Presbyopic branch target prefetch method and apparatus

ABSTRACT

An instruction prefetch apparatus includes a branch target buffer (BTB), a presbyopic target buffer (PTB) and a prefetch stream buffer (PSB). The BTB includes records that map branch addresses to branch target addresses, and the PTB includes records that map branch target addresses to subsequent branch target addresses. When a branch instruction is encountered, the BTB can predict the dynamically adjacent subsequent block entry location as the branch target address in the record that also includes the branch instruction address. The PTB can predict multiple subsequent blocks by mapping the branch target address to subsequent dynamic blocks. The PSB holds instructions prefetched from subsequent blocks predicted by the PTB.

FIELD

[0001] The present invention relates generally to microprocessors, andmore specifically to microprocessors employing branch target predictionand prefetch mechanisms.

BACKGROUND

[0002] Many modern microprocessors have large instruction pipelines thatfacilitate high speed operation. “Fetched” program instructions enterthe pipeline, undergo operations such as decoding and executing inintermediate stages of the pipeline, and are “retired” at the end of thepipeline. When the pipeline receives a valid instruction each clockcycle, the pipeline remains full and performance is good. When validinstructions are not received each cycle, the pipeline does not remainfull, and performance can suffer. For example, performance problems canresult from branch instructions in program code. If a branch instructionis encountered in the program and the processing branches to the targetaddress, a portion of the instruction pipeline may have to be flushed,resulting in a performance penalty.

[0003] Branch Target Buffers (BTB) have been devised to lessen theimpact of branch instructions on pipeline efficiency. A discussion ofBTBs can be found in: David A. Patterson & John L. Hennessy, ComputerArchitecture A Quantitative Approach 271-275 (2d ed. 1990). A typicalBTB application is also shown in FIG. 1A. FIG. 1A shows BTB 10 coupledto instruction pointer (IP) 18, and processor pipeline 20. Also includedin FIG. 1A are cache 30 and fetch buffer 32.

[0004] The location of the next instruction to be fetched is specifiedby IP 18. As execution proceeds in sequential order in a program, IP 18increments each cycle. The output of IP 18 drives port 34 of cache 30and specifies the address from which the next instruction is to befetched. Cache 30 provides the instruction to fetch buffer 32, which inturn provides the instruction to processor pipeline 20. Fetch buffer 32typically has a latency associated therewith, herein referred to as“Icache latency.”

[0005] When instructions are received by pipeline 20, they proceedthrough several stages shown as fetch stage 22, decode stage 24,intermediate stages 26, and retire stage 28. Information on whether abranch instruction results in a taken branch is typically not availableuntil a later pipeline stage, such as retire stage 28. When BTB 10 isnot present and a branch is taken, fetch buffer 32 and the portion ofinstruction pipeline 20 following the branch instruction holdinstructions from the wrong execution path. The invalid instructions inprocessor pipeline 20 and fetch buffer 32 are flushed, and IP 18 iswritten with the branch target address. A performance penalty results,in part because the processor waits while fetch buffer 32 andinstruction pipeline 20 are filled with instructions starting at thebranch target address. The performance penalty is roughly equal to thesum of the Icache latency and the processor pipeline latency.

[0006] Branch target buffers lessen the performance impact of takenbranches. BTB 10 includes records 11, each having a branch address (BA)field 12 and a target address (TA) field 14. TA field 14 holds thebranch target address for the branch instruction located at the addressspecified by the corresponding BA field 12. When a branch instruction isencountered by processor pipeline 20, the BA fields 12 of records 11 aresearched for a record matching the address of the branch instruction. Iffound, IP 18 is changed to the value of the TA field 14 corresponding tothe found BA field 12. As a result, instructions are next fetchedstarting at the branch target address. This mechanism is commonlyreferred to as “branch target prefetch.”

[0007] Branch target prefetch can occur while the branch instruction isstill early in the processor pipeline, such as in decode stage 24. Inthis case, when the predicted branch is actually taken, the latency isreduced from the sum of the Icache latency and the processor pipelinelatency described above; however, the penalty associated with fetchbuffer 32 (Icache latency) remains.

[0008] The latency associated with the use of BTB 10 is shown in FIG.1B. In region 60, the processor pipeline has filled, and performance isgood. In region 70, a branch is taken, and the fetch buffer is flushedand refilled. As shown in region 70, performance drops as the pipelineis flushed, and then performance is regained as the pipeline isrefilled. Performance drops during latency period 50. Latency period 50is a function of the fetch buffer depth and the relative speeds of theprocessor pipeline and the cache. As the processor pipeline increases inspeed, latency period 50 increases when expressed as a number of cycles.

[0009] For the reasons stated above, and for other reasons stated belowwhich will become apparent to those skilled in the art upon reading andunderstanding the present specification, there is a need in the art foran alternate method and apparatus for branch target prefetch.

BRIEF DESCRIPTION OF THE DRAWINGS

[0010]FIG. 1A is a prior art branch target prefetch mechanism;

[0011]FIG. 1B shows prior art performance during a branch targetprefetch;

[0012]FIG. 2 shows a processor including a presbyopic branch targetprefetch mechanism;

[0013]FIG. 3 shows a software control flow graph;

[0014]FIG. 4A shows a branch target buffer and a presbyopic targetbuffer in accordance with an embodiment of the invention;

[0015]FIG. 4B shows a branch target buffer and a presbyopic targetbuffer in accordance with another embodiment of the invention;

[0016]FIG. 5 shows a prefetch stream buffer;

[0017]FIG. 6 shows a series of function calls and returns; and

[0018]FIG. 7 shows a return stack buffer and a presbyopic return stackbuffer.

DESCRIPTION OF EMBODIMENTS

[0019] In the following detailed description of the embodiments,reference is made to the accompanying drawings that show, by way ofillustration, specific embodiments in which the invention may bepracticed. In the drawings, like numerals describe substantially similarcomponents throughout the several views. These embodiments are describedin sufficient detail to enable those skilled in the art to practice theinvention. Other embodiments may be utilized and structural, logical,and electrical changes may be made without departing from the scope ofthe present invention. Moreover, it is to be understood that the variousembodiments of the invention, although different, are not necessarilymutually exclusive. For example, a particular feature, structure, orcharacteristic described in one embodiment may be included within otherembodiments. The following detailed description is, therefore, not to betaken in a limiting sense, and the scope of the present invention isdefined only by the appended claims, along with the full scope ofequivalents to which such claims are entitled.

[0020]FIG. 2 shows a processor including a presbyopic branch targetprefetch mechanism. Apparatus 200 includes branch target buffer (BTB)205, presbyopic target buffer (PTB) 210, and cache memory 220. Apparatus200 also includes fetch buffer (FB) 32, target block prefetch streambuffer (PSB) 250, multiplexer 260, and processor pipeline 20. BTB 205receives branch instruction addresses from processor pipeline 20 on node202 and predicts branch target addresses. Branch target addresses areprovided to port 222 of cache 220, and processor instructions startingat the target address are provided to FB 32.

[0021] BTB 205 also provides the branch target address to PTB 210 onnode 207. In some embodiments, PTB 210, unlike BTB 205, maps branchtarget addresses to subsequent branch target addresses. For example,whereas BTB 205 maps an exit IP from a block to an entrance IP of asubsequent block, PTB 210 maps an entrance IP of a block to an entranceIP of a subsequent block. Embodiments of PTB 210 are described morefully with reference to FIGS. 4A and 4B below.

[0022] In some embodiments, PTB 210 can be recursively searched as shownby node 209 in FIG. 2. Recursive searching in PTB 210 is also more fullyexplained below. PTB 210 provides the entrance IP of a subsequent blockto cache 220 on port 224. Instructions fetched from the subsequent blockentrance IP are provided to PSB 250. PSB 250 is a prefetch stream buffercapable of holding instructions prefetched as a result of the operationof PTB 210. In some embodiments, PSB 250 is at least as long as theIcache latency so that when FB 32 is flushed as a result of a branch,prefetched instructions can be provided from PSB 250 through multiplexor260.

[0023] PTB 210 operates as a “far-sighted” branch target buffer throughthe various mapping schemes and recursive searches employed. Branchtargets and subsequent blocks that dynamically reside multiple blocks inthe future can be predicted by PTB 210, as is explained in more detailbelow. PTB 210 is referred to as “presbyopic” to reflect the far-sightednature of its operation, and to differentiate PTB 210 from BTB 205.

[0024]FIG. 3 shows a software control flow graph. Control flow graph 300shows blocks 301, 310, 320, 330, 340, and 350, which represent softwarecode regions, or “blocks,” each including a block entrance instruction,intermediate instructions, and a block exit instruction. Each of theseinstructions occurs at a location specified at runtime as an IP value.For example, block 301 has an instruction labeled “a.in” at entrance IP302, has an instruction labeled “a.out” at exit IP 306, and hasintermediate instructions 304 therebetween. The entrance IP of eachblock can be a target address for a branch instruction. For example, inblock 301, when “a.out” is a branch instruction, “b.in” is the targetaddress of the branch instruction. In general, target addresses ofbranch instructions correspond to entrance IPs of subsequent blocks.

[0025] Blocks that occur later in the control flow are termed“subsequent blocks.” For example, blocks 310 and 320 are blockssubsequent to block 301. Likewise, blocks 330, 340, and 350 are blockssubsequent to blocks 301, 310, and 320. “Dynamically adjacent”subsequent blocks are blocks that execute one after another. Forexample, blocks 320 and 330 are dynamically adjacent, but blocks 310 and330 are not. Dynamically adjacent blocks are not necessarily physicallyadjacent in the program code.

[0026] Blocks 320, 330, 340, and 350 form a “hammock.” A hammock occurswhen the control flow can branch to different subsequent blocks, and thedifferent subsequent blocks return control to a common subsequent block.For example, in control flow graph 300, block 320 can branch to eitherblock 330 or block 340. In other words, the “e.out” instruction at exitIP 326 can have a target address that resolves to either entrance IP 332or entrance IP 342. When the target address resolves to entrance IP 332,control flow branches to block 330, and instructions beginning with“d.in” are executed. In contrast, when the target address resolves toentrance IP 342, control flow branches to block 340, and instructionsbeginning with “e.in” are executed.

[0027] The hammock is formed because both blocks 330 and 340 branch toblock 350. For example, in block 330, the “d.out” instruction at exit IP336 branches to the “f.in” instruction at entrance IP 352. Likewise, inblock 340, the “e.out” instruction at exit IP 346 also branches to the“f.in” instruction at entrance IP 352. Even if branch prediction fromblock 320 to either block 330 or 340 is unreliable, predicting block 350as a block subsequent to block 320 may be reliable as a result of thehammock.

[0028] As just described, hammocks can create a scenario whereprediction may be more reliable when predicted subsequent blocks are notdynamically adjacent, but are instead more than one dynamic block away.“Skip-adjacent” prediction can be used to reliably predict subsequentblocks more than one dynamic block away. FIG. 4A gives one example ofskip-adjacent prediction.

[0029]FIG. 4A shows a branch target buffer and a presbyopic targetbuffer in accordance with an embodiment of the invention. Embodiment 400includes branch target buffer (BTB) 205, and presbyopic target buffer(PTB) 210. BTB 205 is an array that caches records 412. Each record 412is organized into fields which include branch address (BA) field 416,target address (TA) field 418, and confidence counter field (CC) 420. BAfield 416 holds the address of branch instructions, and TA field 418holds addresses of branch targets. For example, the first record shownin BTB 205 maps “a.out” to “b.in.” As shown in FIG. 3, “a.out” is abranch instruction at a block exit, and “b.in” is the first instructionat a subsequent block entrance.

[0030] BTB 205 receives the current instruction pointer on node 202 andperforms a search of BA field 416. If a matching record is found, thecurrent instruction pointer on node 202 points to a branch instructionand BTB 205 predicts the target address by sending the corresponding TAfield 418 value out on node 207. For example, when node 202 has theaddress of “b.out” impressed thereon, BTB 205 will drive node 207 withthe address of “c.in.”

[0031] BTB 205 also includes CC Field 420. In some embodiments, CC Field420 includes a saturating counter that counts the number of times thecached branch is taken. For example, in embodiment 400, CC Field 420includes a 3 bit saturating counter. Each time the cached branch istaken, the saturating counter in CC Field 420 is incremented. When thecounter reaches the maximum value, the counter remains at the maximumvalue and no longer increments. Each time the branch is not taken, thesaturating counter decrements. If a saturating counter drops below zero,then the confidence in the cached branched is eroded to the point thatthe corresponding record is removed from BTB 205. In some embodiments,CC field 420 is kept small, in part because BTB 205 can be on a criticalpath for instruction fetches. In embodiment 400, CC field 420 is shownas three bits wide. In this embodiment, eight consecutive non-takenbranches will cause an record be removed from BTB 205.

[0032] PTB 210 includes records that map a branch target address (or anentrance to a block) to a subsequent branch target address (or to anentrance to a subsequent dynamic block). PTB 210 includes target address(TA) fields 424 and 426. PTB 210 receives the current instructionpointer value on node 418 and performs a search for a record having amatching value in TA field 424. When found, the contents of TA field 426are driven on node 212. Node 212 can then be used to drive a cache port,such as cache port 224 (FIG. 2).

[0033] PTB searches and predictions can occur in the same clock cycle asthe BTB search, or can be deferred to subsequent clock cycles. Inaddition, PTB lookup and prediction can be performed either upon the BTBlookup at the fetch stage in the front end of pipeline 20 (FIG. 2), orafter the branch target address is actually resolved at the end of thepipeline.

[0034] PTB 210 can also perform recursive searches. A recursive searchis performed when PTB 210 drives node 209 with a target address from TAfield 426, and searches TA field 424. In this manner, PTB 210 canpredict multiple subsequent dynamic blocks. For example, when blockentrance IP 302 (FIG. 3) appears as an input to PTB 210 on node 418, PTB210 matches the first record and drives node 209 with entrance IP 312,which is the address of the instruction “b.in.” PTB 210 receives this,performs a search, and finds a record that maps “b.in” to “c.in,” anddrives the address of “c.in” on node 209. PTB 210 can then recursivelysearch based on the address on node 209.

[0035] Searches can also include BTB 205 and PTB 210 in combination.When a branch instruction is pointed to by the current IP on node 202,BTB 205 will drive the corresponding target address (if a matchingrecord is found) on node 207. PTB 210 receives the target address onnode 207 and can use it to perform a search for a subsequent dynamicblock. For example, if the location of “b.out” is on node 202, BTB 205will drive node 207 with the location of “c.in.” PTB 210 receives thelocation of “c.in” on node 207, finds a matching record, and drivesnodes 212 and 209 with the location of “f.in.” At this point, a furtherrecursive search can take place.

[0036] Recursive searches of PTB 210, and searches utilizing both BTB205 and PTB 210 result in “domino prediction.” Domino prediction occurswhen multiple subsequent dynamic blocks are predicted. PTB 210 canperform domino prediction off the critical path, and can cause theprefetch of instructions from subsequent blocks more than one dynamicblock away. Referring now to FIG. 2, PTB 210 is shown driving port 224of cache 220. Cache 220 sends prefetched instructions to PSB 250. Whenperforming domino prediction, PSB 250 can include instructionsprefetched from multiple predicted subsequent dynamic blocks. PSB 250can include all of the instructions from a predicted block, or caninclude a subset of the predicted block.

[0037] Some embodiments support multi-way domino prediction. Forbranches that are likely to frequently take multiple dynamic targets,multiple target basic blocks can be captured via associating a singleBTB record with multiple PTB records. In some embodiments, BTB 205 andPTB 210 include index fields for the association, and in otherembodiments, multiple PTBs are implemented.

[0038] PTB 210 also includes confidence counter (CC) field 428. CC field428 operates in a manner similar to that of CC field 420 of BTB 205.Each time a predicted branch is actually taken, the corresponding CCfield 428 is incremented unless saturated, and each time the predictedbranch is not taken, CC field 428 is decremented. In some embodiments,the confidence counter of PTB 210 is larger than confidence countersused in BTB 215. Because PTB 210 performs subsequent block predictionwell in advance of the actual execution of the predicted subsequentblock, PTB 210 is not on the critical path. More time can be taken toincrement and decrement confidence counters, and so CC field 428 can belarge. A large CC field 428 in PTB 210 can increase the accuracy ofsubsequent dynamic block prediction.

[0039] PTB 210 can also perform skip-adjacent prediction. Record 434within PTB 210 is an example of a PTB record that performs skip-adjacentprediction. Record 434 maps the location of instruction “c.in” to thelocation of instruction “f.in.” This corresponds to mapping entrance IP322 of block 320 to entrance IP 352 of block 350 (FIG. 3). When thisprediction occurs, PSB 250 can include instructions from block 320 (“a”instructions) and instructions from block 350 (“f” instructions) withoutincluding any instructions from either block 330 or 340 (“d” or “e”instructions). This is an example of skip-adjacent prediction becauseblocks dynamically adjacent to block 320 are skipped in favor of asubsequent block occurring later in the control flow.

[0040] When BTB 205 is searched, and a matching record is found, thecurrent IP specifies the location of a branch instruction. If BTB 205and PTB 210 are populated with records that correctly predict thebranches taken on the current control flow, the instruction located atthe predicted target address and its subsequent instructions are likelyalready in PSB 250 (FIG. 2), because this current branch has likely beenpredicted previously as a domino prediction.

[0041] In some embodiments, BTB 205 and PTB 210 share a single targetaddress array. For example, BTB 205 includes a record that maps “b.out”to “c.in,” and PTB 210 includes a record that maps “b.in” to “c.in.” The“c.in” target field value is common to both BTB 205 and PTB, and can beshared.

[0042] In some embodiments, domino prediction is performed in a“disjoint eager” fashion, in which a confidence gauge is associated witheach branch prediction made speculatively along a dynamic path. Aspredictions are made further along the path, the confidence of theprediction degrades. As prediction confidence degrades, multiplealternative targets can be fetched instead of choosing a single path.When disjoint eager domino prediction is performed, instructions can beprefetched into PSB 250 (FIG. 2) from multiple disjoint paths.

[0043] TA fields 424 and 426 can includes the total number of bitsneeded to unambiguously specify an address, or can include a lessernumber. For example, in a processor that specifies addresses using 32bits, TA fields 424 and 426 may be 32 bits wide or less than 32 bitswide. Using 32 bits will unambiguously specify the address, but willalso take up storage space. In some embodiments, TA fields 424 and 426include fewer than the total number of bits, and introduce a smallamount of ambiguity in exchange for reduced size.

[0044] When fewer than the total number of bits is used, a matchingrecord may correspond to a branch instruction that is aliased to thecurrent IP value. For example, an instruction that is not at a blockentry or a block exit may cause a match in PTB 210 if the subset of bitsused to specify TA field 424 matches. In some embodiments, an additionalpipeline stage in pipeline 20 (FIG. 2) is used to check for a fulladdress match to check for this condition.

[0045] BTB 205 and PTB 210 are populated with records as branches areencountered during the execution of the software. When a new branch istaken, a new record is entered in BTB 205, and the branch address andtarget address are filled in. For each branch IP installed in BTB 205,the target address is also installed in a new record in PTB 210. In someembodiments, a parentheses matching state machine is employed to capturethe entrance IP of a block before the exit IP of the same block isinstalled in BTB 205. When the BTB record is installed, thecorresponding PTB record can also be installed. For example, aparentheses matching state machine can record the location of “a.in”when it is encountered, and leave the parentheses “open.” When “a.out”is encountered, and control branches to “b.in,” the state machine“closes” the parentheses, and the PTB record that maps “a.in” to “b.in”can be installed at the same time as the BTB record that maps “a.out” to“b.in.” If an exception occurs when the parentheses matching statemachine is “open,” the PTB record may never be installed. In this case,a BTB record will exist without a corresponding PTB record.

[0046]FIG. 4B shows a branch target buffer and a presbyopic targetbuffer in accordance with another embodiment of the invention.Embodiment 440 includes BTB 205 and PTB 450. BTB 205 accepts the currentIP value on node 202, and also accepts a branch address from PTB 450 onnode 460. In embodiment 440, PTB 450 has records that map targetaddresses (TA) 452 to branch addresses (BA) 454. TA 452 and BA 454correspond to entrance IPs and exit IPs of blocks.

[0047] As shown in FIG. 4B, PTB 450 maps block entrance IPs to blockexit IPs. For example, the first record in PTB 450 maps the location ofinstruction “b.in” to the location of instruction “b.out.” In embodiment440, the combination of BTB 205 and PTB 450 can be recursively searched.For example, when node 202 has the location of instruction “b.out”impressed thereon, BTB 205 finds a matching record, and drives node 423with the location of instruction “c.in.” PTB 450 performs a search of TAfields 452, finds a matching record, and drives node 460 with thelocation of instruction “c.out.” This process can continue to predictmultiple subsequent dynamic blocks.

[0048]FIG. 5 shows a prefetch stream buffer. Prefetch stream buffer(PSB) 250 includes instructions fetched as a result of subsequent blockspredicted by the action of a presbyopic target buffer, such as PTB 210(FIG. 2). Each record in PSB 250 includes an instruction 510 and acoloring field 520. Instruction field 510 holds prefetched instructions,and coloring field 520 serves to demarcate boundaries between blocks ofinstructions included within PSB 250. For example, entries 522correspond to block “a,” shown in FIG. 3 as block 301. Entries 522 areshown having a value of “a” in field 520, thereby signifying entries 522having instructions from block 301. Likewise, entries 524 have coloringfield 520 values of “b,” entries 526 have coloring field 520 values of“c,” and entries 528 have coloring field 520 values of “f.” Each ofthese values corresponds to a different block in control flow graph 300(FIG. 3).

[0049] In some embodiments, coloring fields 520 are assigned asequentially allocated unique number for each block that is predictedand prefetched into PSB 250. The value of the block color can beproduced with a shift register, with the least significant bitrepresenting the prediction of the latest branch. In this manner, thecolor value assigned to coloring field 520 is similar to a fragment ofglobal history. In some embodiments, a cache or other memory structureis employed to save past color history, and the characteristic signaturebranch IP is used to retrieve past color history to hint or guide futuredomino predictions. This can be used to bound the depth of dominoprediction.

[0050] In some embodiments, coloring field 520 is represented by afinite number of bits, such that each possible field value represents adifferent block. If a branch is mispredicted, coloring field 520 can beused to flush or invalidate instructions on the mispredicted path. Forexample, if PSB 250 included instructions for block “e,” and block “d”was traversed instead, the instructions for block “e” could beidentified within PSB 250 and flushed.

[0051] When disjoint eager prediction is performed, coloring field 520can be assigned values such that mutually exclusive disjoint eagerlypredicted and prefetched blocks are identified as such. As branchtargets are resolved, blocks dependent on predicates compatible with theconditional code of the mispredicted branch can be flushed from PSB 250.

[0052] In some embodiments, PSB 250 is at least as long as FB 32 (FIG.2), referred to as the Icache latency. When branch prediction by PTB 210is correct, and PSB 250 has at least enough prefetched instructions toovercome the Icache latency, performance improves over a system with abranch target buffer alone. In embodiments capable of domino prediction,PSB 250 can be large enough to hold instructions from multiplesubsequent dynamic blocks. In some embodiments, all of the instructionsfrom the predicted blocks are prefetched, and in other embodiments, justenough instructions are prefetched from each predicted subsequentdynamic block to overcome the Icache latency.

[0053]FIG. 6 shows a series of functions calls and returns. Embodiment600 includes software functions 610, 620, and 630. Instructions withinsoftware function 610 are prefixed with the letter “a,” instructionswithin software function 620 are prefixed with the letter “b,” andinstructions within software function 630 are prefixed with the letter“c.” In the control flow shown in FIG. 6, function 610 starts at theinstruction “a.in,” and continues until reaching instruction “a.call,”which calls software function 620. The next instruction executed is“b.in,” and execution continues in software function 620 until reachinginstruction “b.call,” which calls software function 630. Softwarefunction 630 executes from instruction “c.in” to instruction “c.ret.”Instruction “c.ret” is a “return” instruction that causes execution tobranch back to the calling point. As a result of the return instruction,execution branches from instruction “c.ret” to instruction “b.call+1,”which is one instruction location away from instruction “b.call.”Software function 620 returns in the same manner when execution branchesfrom instruction “b.ret” to instruction “a.call+1.”

[0054] In some embodiments, function returns, such as those caused byinstructions “c.ret” and “b.ret,” can be predicted in a manner similarto branch prediction described with reference to the preceding figures.For example, return instructions can be treated as block exits, andinstructions occurring after call instructions can be treated as blockentrances. One such embodiment is now explained with reference to FIG.7.

[0055]FIG. 7 shows a return stack buffer and a presbyopic return stackbuffer. Embodiment 700 shows return stack buffer (RSB) 710 andpresbyopic return stack buffer (PRSB) 720. RSB 710 operates in a mannersimilar to BTB 205 (FIG. 4A). Each of records 712 includes a branchaddress (BA) field 714 and a target address (TA) field 716. Within RSB710, BA field 714 holds the address of return instructions, and TA field716 holds the address of instructions dynamically following the returninstructions. For example, the first record of RSB 710 caches theaddress of instruction “b.call+1” as the address predicted to follow theaddress of instruction “c.ret.”

[0056] PRSB 720 includes records that map target addresses to targetaddresses. For example, the record shown in PRSB 720 predictsinstruction “a.call+1” to follow instruction “b.call+1.” RSB 710 andPRSB 720 can be utilized together in a manner similar to embodiments 400(FIG. 4A) and 440 (FIG. 4B) to predict blocks subsequent to a functionreturn.

[0057] RSB 710 and PRSB 720 have been described with reference tofunction calls and returns, but are also applicable to jump targettables. The combination of RSB 710 and PRSB 720 can be used to map theentrance IP of a block to the jump target of the block, such thatinstructions at the next target block can be prefetched upon entranceinto the current block that is ended by a jump.

[0058] It is to be understood that the above description is intended tobe illustrative, and not restrictive. Many other embodiments will beapparent to those of skill in the art upon reading and understanding theabove description. The scope of the invention should, therefore, bedetermined with reference to the appended claims, along with the fullscope of equivalents to which such claims are entitled.

What is claimed is:
 1. A branch target prefetch apparatus comprising: apresbyopic target buffer configured to receive a presbyopic targetbuffer record, wherein the presbyopic target buffer record maps an entrylocation of a first code region to an entry location of a second coderegion; and a prefetch stream buffer configured to receive instructionsfrom the second code region responsive to an instruction pointerencountering the entry location of the first code region.
 2. The branchtarget prefetch apparatus of claim 1 wherein the presbyopic targetbuffer is configured to receive the presbyopic target buffer recordresponsive to a branch instruction being encountered in the first coderegion, the branch instruction having a branch target address equal tothe entry location of the second code region.
 3. The branch targetprefetch apparatus of claim 2 further comprising a branch target bufferconfigured to receive a branch target buffer record that maps an addressof the branch instruction to the entry location of the second coderegion.
 4. The branch target prefetch apparatus of claim 3 wherein thepresbyopic target buffer is configured to receive a plurality ofpresbyopic target buffer records, and is further configured to besearched recursively.
 5. The branch target prefetch apparatus of claim 4wherein the prefetch stream buffer is configured to receive instructionsfrom a plurality of code regions responsive to a recursive search of thepresbyopic target buffer.
 6. The branch target prefetch apparatus ofclaim 5 wherein the prefetch stream buffer is configured todifferentiate between instructions such that instructions from differentones of the plurality of code regions can be invalidated.
 7. The branchtarget prefetch apparatus of claim 3 wherein the branch target bufferrecord includes a first confidence counter having a first number ofbits, and the presbyopic target buffer record includes a secondconfidence counter having a second number of bits that is greater thanthe first number of bits.
 8. The branch target prefetch apparatus ofclaim 1 wherein the presbyopic target buffer record is configured to mapthe entry location of the first code region to entry locations of aplurality of second code regions.
 9. The branch target prefetchapparatus of claim 1 wherein a cache memory has a cache latencyassociated therewith, and the prefetch target buffer has a depth atleast as deep as one cache latency.
 10. A processor comprising: a branchtarget buffer responsive to fetched instruction addresses, wherein thebranch target buffer is configured to map branch instruction addressesto branch target addresses; and a presbyopic target buffer responsive tothe branch target buffer, wherein the presbyopic target buffer isconfigured to map branch target addresses to subsequent branch targetaddresses.
 11. The processor of claim 10 further comprising: a streambuffer configured to receive instructions fetched from subsequent branchtarget addresses specified in the presbyopic target buffer.
 12. Theprocessor of claim 10 wherein the presbyopic target buffer is configuredto be recursively searched to predict a plurality of subsequent branchtarget addresses.
 13. The processor of claim 10 wherein the presbyopictarget buffer implements skip-adjacent mapping.
 14. The processor ofclaim 10 wherein a complete branch target address is specified by afixed number of bits, and the presbyopic target buffer includes mappingrecords that specify branch target addresses using less than the fixednumber of bits.
 15. A processor comprising: a branch target bufferresponsive to fetched instruction addresses, wherein the branch targetbuffer is configured to be searched for the fetched instructionaddresses and corresponding branch target addresses; a presbyopic targetbuffer responsive to the branch target buffer, wherein the presbyopictarget buffer is configured to be searched for subsequent dynamic blocksas a function of branch target addresses.
 16. The processor of claim 15wherein the presbyopic target buffer is configured to map branch targetaddresses to subsequent dynamic block exit addresses.
 17. The processorof claim 16 wherein the branch target buffer is further responsive tosubsequent dynamic block exit addresses from the presbyopic targetbuffer.
 18. The processor of claim 17 wherein the branch target bufferand presbyopic target buffer are configured to be searched recursivelyin combination.
 19. A processor comprising: a first fetch bufferconfigured to receive instructions prefetched from predicted branchtarget addresses; and a second fetch buffer configured to receiveinstructions prefetched from predicted subsequent blocks.
 20. Theprocessor of claim 19 wherein the second fetch buffer includes acoloring field for each instruction included therein, such that eachinstruction included therein can be assigned a color.
 21. The processorof claim 19 wherein the second fetch buffer includes a subsequent blockdemarcation mechanism to distinguish prefetched instructions fromdifferent predicted subsequent blocks.
 22. The processor of claim 19further including a branch target buffer having records that whenpopulated, map branches to predicted branch targets.
 23. The processorof claim 22 further including a presbyopic target buffer having recordsthat when populated, map predicted branch target addresses to predictedsubsequent blocks.
 24. The processor of claim 23 wherein the presbyopictarget buffer maps each predicted branch target address to a pluralityof predicted subsequent blocks.
 25. The processor of claim 23 whereinthe presbyopic target buffer is configured to be recursively searched.26. An instruction prefetch method comprising: in a first buffer thatmaps branch instruction addresses to block entry addresses, searchingfor a first buffer record having a branch instruction address thatmatches a current instruction address; when the first buffer record isfound, searching a second buffer that maps block entry addresses tosubsequent block entry addresses for a second buffer record having ablock entry address matching the first buffer record; and when thesecond buffer record is found, prefetching instructions beginning at asubsequent block entry address included in the second buffer record. 27.The method of claim 26 wherein prefetching comprises enteringinstructions into a stream buffer, the stream buffer having a coloringfield for each instruction entered.
 28. The method of claim 26 furthercomprising: searching the second buffer recursively; and for eachmatching record found in the second buffer, each matching record havinga corresponding subsequent block entry address, prefetching instructionsfrom each of the corresponding subsequent block entry addresses.
 29. Themethod of claim 28 wherein prefetching comprises: entering instructionsinto a stream buffer, the stream buffer having a coloring field for eachinstruction entered; and assigning a different color to instructionsfetched from different subsequent block entry addresses.
 30. The methodof claim 29 wherein each recursive search represents a predicted branch,the method further comprising flushing from the stream bufferinstructions prefetched as a result of a mispredicted branch.